In classical linear regression, we focus on using independent variables to characterize a dependent variable. In other words, we would be characterizing the conditional mean function of the dependent variable, conditional on the independent variables, relying on iid assumption.
In univariate TSA, we focus on characterizing one series, with observations that are typically dependent on each other. It’s that dependency that we want to characterize. It’s the existence of these dependencies that enable us to forecast (although there are some dependency models - e.g. random walk - that can’t be forecast).
In theory a time series begins in infinite past and continues into infinite future (important for deriving properties of TS models).
A Stochastic Process is a sequence of random variables, indexed by time, a statistical phenomenon that evolves according to probability laws. A time series is a discrete realization of this “data-generating process,” a set of observations generated sequentially in time. The stochastic process’s analogy in a classical regression is the the underlying (data-generating) population model that we try to discover.
Time indices are \(t_1, t_2, ... t_N\) and observations are \(z(t_1)...z(t_N)\). If they’re equidistant, we simplify to \(z_1, ... z_N\).
A Discrete-Time Stochastic Process is a sequence of random variables (e.g. \(\{... z_{-1}, z_0, z_1, ... z_N\}\)). A finite subset of this realization (e.g. \(\{z_1, ... z_N\}\)) is called a sample path, which can be considered a collection of observations.
IMPORTANT: in CLR, a cross-section of observations is considered many realizations or draws. In time series analysis, a collection of observations is considered only one draw or one realization. We’re analyzing \(T\) observations, so we have a \(Tx1\) vector (contrast to the \(nxk\) matrix for CLR).
Note that there is no error term associated with the stochastic process by itself; the error term is only introduced when trying to break down a time series into deterministic and stochastic components.
NOTE: lag plot is not a part of the EDA. It’s used only to help an audience grasp the intuition behind the ACF plots. Show it to executives, but to us as data scientists, it holds less info than ACF plots.
Pattern 1: Trend with flutuation around the trend
Pattern 2: Change in Structure
Pattern 3: Variation around Stable Mean
Pattern 4: Periodicity
A complete description of a TS as a collection of \(k\) random variables requires a joint distribution function:
\[ F(c_1, c_2,...c_n) = P(x_{t_1} \le c_1, x_{t_2} \le c_2, ...x_{t_n} \le c_n)\]
Characterizing this in its general form is impossible. A single realization of a TS doesn’t offer enough information to characterize the underlying joint distribution function.
One of the most important probabilitistic features is the dependency structure embedded in joint distributions.
The mean function of a stochastic process is:
\[ \mu_x(t) = E(x_t) = \int_{-\infty}^{\infty}x_tf_t(x_t)dt\]
If the function is constant over time, then the underlying stochastic process is said to be stationary in the mean. The variance for a TS that is stationary in the mean is:
\[ \sigma_x^2(t) = E(x_t - \mu)^t = \int_{\infty}^{\infty} (x_t - \mu)^2f_t(x_t)dx_t\]
Note that both mean and variance are functions of the index (\(t_i\)). However, it’s impossible to estimate the different variances at different points in time with only a single TS (ie. single realization of the SP). As a result, another popular assumption is stationarity of variance.
In the context of TSA, we speak of covariance and correlation as between multiple RVs in the same series, so we refer to them as autocovariance and autocorrelation. The autocovariance function (avcf) is:
\[ \gamma_x(s,t) = cov(x_s, x_t) = E[(x_s-\mu_s)(x_t-\mu_t)] \forall s,t\] By properties of covariance, we know that \(\gamma_x(s,t) = \gamma_x(t,s)\) and \(\gamma_x(s,s)=cov(x_s,x_s)=var(x_s)\).
Stationarity of mean, variance, and autocovariance produce a large class of stationary TS models.
If a TS is “second-order stationary”, that means it’s stationary in both mean and variance, so \(\mu_t = \mu\) and \(\sigma_t^2 = \sigma^2\) for all \(t\). Then we write the autocovariance function as:
\[ \gamma_k = Cov(x_t, x_{t+k}) = E[(x_t - \mu)(x_{t+k} - \mu)]\] Then the autocorrelation function (acf) is:
\[ \rho_k = \frac{\gamma_k}{\gamma_0} = \frac{\gamma_k}{\sigma^2} \] If \(k=0\), then it follows that \(\gamma_k = \sigma^2\) and \(rho_k=1\). The dependence between values of a TS is important, so it’s important to estimate autocorrelation with precision.
We use moment principles to estimate values of acvf and acf from their sample equivalents. Note the division by \(T\), the total number of measurements (not \(t\)).
\[ \text{sample acvf: } \hat{\gamma_k} = \frac{1}{T} \sum_{t=1}^{T-k} (x_t - \hat{x})(x_{t+k} - \hat{x})\]
\[ \text{sample acf: } \hat{\rho_k} = \frac{\hat{\gamma_k}}{\hat{\gamma_0}} = \frac{\frac{1}{T} \sum_{t=1}^{T-k} (x_t - \hat{x})(x_{t+k} - \hat{x})}{\frac{1}{T} \sum_{t=1}^{T} (x_t - \hat{x})^2}\]
This is a conditional correlation, conditional on other explanatory variables being accounted for in a TS model. The partial autocorrelation of a process \(z_t\) at lag \(k\) (\(\alpha_{kk}\)) is the autocorrelation between \(z_t\) and \(z_{t-k}\), adjusting for effects of variables \(z_{t-1}, z_{t-2}, ... z_{t-k+1}\). In other words, if you did a linear regression of \(z_t\) on all of the other \(z\) variables, \(\alpha_{kk}\) would be the coefficient for \(z_{t-k}\). (Note: this kind of regression would be called an autoregression.)
Or more simply put, the partial autocorrelation at lag \(k\) is the correlation that results after removing the effect of any correlations at shorter lags. The partial autocorrelation at lag \(k\) is the \(k\)th coefficient of a fitted \(AR(k)\) model.
The partial autocorrelation summarizes the dynamics of a process, and as such it’s a powerful tool for identifying the order \(p\) of an \(AR(p)\) model. The pacf plot cuts off abruptly after the lag corresponding to the order of the AR function, independent of the value of the autoregressive parameter.
A time series \(x_t\) is strictly stationary if the distribution is unchanged for any time shift. This isn’t practical for any modeling. Mathematically:
\[ F(x_{t_1}...x_{t_n}) = F(x_{t_1+m}...x_{t_n+m}) \forall t_1..t_n\ and\ m\] A weak stationarity (aka. second-order stationarity) is more practical. A time series \(x_t\) is weakly stationary if its mean and variance are stationary, and its \(cov(x_t,x_{t+k})\) only depends on \(k\) (can be written as \(\gamma(k)\)).
Recall that the discrete white noise process (\(w_t\)) is a sequence of RVs indexed by \(t\) that are iid and have:
\[E(w_t) = \mu_w=0\]
\[ \gamma_k = cov(w_t, w_{t+k}) = \begin{cases} \sigma_w^2,k = 0 \\ 0, k \ne0\\ \end{cases} \]
\[ \rho_k = \begin{cases} 1, k=0 \\ 0, k \ne 0\end{cases}\]
# White Noise simulation
sim <- rnorm(500, mean=0, sd=1)
layout(1:1)
hist(sim)
plot(ts(sim))
acf(ts(sim))
pacf(ts(sim)) # note the y-axis scale
As expected, the series appears random, and the ACF shows no significant correlation with any lags.
Discrete White Noise is important because we use it as a model for residuals. When we try to fit other time series models to our observed data, we use Discrete White Noise to confirm that we have eliminated any remaining serial correlation from the residuals, resulting in a good model fit.
Consider a model with a deterministic linear trend: \(x_t = a + bt + w_t\), where \(w_t\) is a white noise process with mean=0 and variance = \(\sigma_w^2\). The expected value of \(x_t\) is:
\[ E(x_t) = E(\beta_0 + \beta_1 t + w_t) = \beta_0 + \beta_1 t + E(w_t) = \beta_0 + \beta_1 t\] \[ Var(x_t) = Var(\beta_0 + \beta_1 t + w_t) = Var(w_t) = \sigma_w^2\]
This means that the expected value (mean) is changing with time, but the variance is not.
To look at correlation:
\[ Cov(x_t, x_{t-1}) = Cov(\beta_0 + \beta_1 t + w_t, \beta_0 + \beta_1 (t-1) + w_{t-1}) = Cov(w_t,w_{t-1})= 0 \] Therefore, the stochastic model with a deterministic linear trend is not (strictly or weakly) stationary, but it can be transformed into a stationary model.
# run a simulation
beta_0 <- 1
beta_1 <- 0.5
sigma_w <- 10
t <- seq(1,500)
w_t <- rnorm(500, 0, sigma_w)
x2 <- beta_0 + beta_1 * t + w_t
x2_ts <- ts(x2)
# plot
layout(1:1)
plot(x2_ts, main="Simulation: Stochastic Model with Deterministic Linear Trend")
lag.plot(x2_ts, lags=9, main="Autocorrelation of Lags")
acf(x2_ts, main="Autocorrelation with Lags")
pacf(x2_ts, main="Partial Autocorrelation")
As expected, we see the mean changes with time, the variance doesn’t, and the ACF plot shows that the series has very high persistence (ie. highly correlated with lags). The scatterplot also show clear correlation.
A MA(1) model takes the form: \(x_t = \alpha w_t + \beta w_{t-1}\) where \(w_t\) is a discrete white noise series with mean=0 and variance \(\sigma_w^2\). Sometimes the \(w_{t-1}\) term is referred to as a “shock.” The expected value of \(x_t\) is:
\[ E(x_t) = E(\alpha w_t + \beta w_{t-1}) = \alpha E(w_t) + \beta E(w_{t-1}) = 0\]
The autocorrelation function is:
\[ \rho_k = \begin{cases} 1, k=0 \\ \frac{\sum_{i=0}^{q-k} \beta_i \beta_{i+k}}{\sum_{i=0}^{q} \beta_i^2}, k=1...q\\ 0, k>q \end{cases}\]
# run a simulation
alpha <- 1
beta <- 0.8
sigma_w <- 10
w_t <- rnorm(500, 0, sigma_w)
x3 <- vector()
x3[1] <- w_t[1]
for (t in 2:500) {
x3[t] <- alpha * w_t[t] + beta * w_t[t-1]
}
x3_ts <- ts(x3)
# plot
layout(1:1)
plot(x3_ts, main="Simulation: Moving Average Model (Order=1)")
lag.plot(x3_ts, lags=9, main="Autocorrelation of Lags")
acf(x3_ts, main="Autocorrelation with Lags")
pacf(x3_ts, main="Partial Autocorrelation")
A distinguishing feature of this model is that the ACF abruptly drops off after the \(order=q\) lags. The scatterplots show some correlation for \(lag=1\), but none after that.
These are the most important time series models. They are founded on the idea that the current value of a series can be explained as a function of the \(p\) past values. A stationary autoreggressive model of order p, \(AR(p)\) takes the form:
\[ x_t = \alpha_0 + \alpha_1 x_{t-1} + \alpha_2 x_{t-2} + ... + \alpha_p x_{t-p} + w_t\]
where \(x_t\) is a stationary series, the \(\alpha\) values are non-zero unknown parameters to be estimated. \(w_t\) is the Gaussian white noise series with \(mean=0\) and \(variance=\sigma_w^2\). \(\alpha = \frac{\mu}{1- \alpha_1 - \alpha_2 ... - \alpha_p}\).
There’s also a technical requirement for invertability, but AR models always meet this.
The AR(1) model takes the form:
\[x_t = \alpha x_{t-1} + w_t\]
where \(w_t\) is a discrete white noise series with mean=0 and variance \(\sigma_w^2\). Because of recursive substitution, an AR(1) process can be written as a linear process, in terms of the sum of infinite white noises:
\[ x_t = \alpha x_{t-1} + w_t = \alpha (\alpha x_{t-2} + w_{t-1}) + w_t = \alpha (\alpha (\alpha x_{t-3} + w_{t-2}) + w_{t-1}) + w_t = ...\]
The result can be written more cleanly as:
\[ x_t = \sum_{i=0}^\infty \alpha^i w_{t-i}\]
provided that \(|\alpha|<1\) and \(x_t\) is stationary. Using that form, it’s easy to calculate the expected value of \(x_t\), showing that the \(AR(1)\) model is stationary with mean:
\[ E(x_t) = E(\sum_{i=0}^\infty \alpha^i w_{t-i}) = \sum_{i=0}^\infty E(\alpha^i w_{t-i}) = \sum_{i=0}^\infty \alpha^i E(w_{t-i}) = 0\]
The linear form of \(x_t\) also leads to the autocovariance:
\[ \gamma_k = Cov(x_t, x_{t+k}) = Cov (\sum_{i=0}^\infty \alpha^i w_{t-i}, \sum_{j=0}^\infty \alpha^j w_{t+k-j}) = \sum_{j=k+i}^\infty \alpha^i \alpha^j Cov(w_{t-i}, w_{t+k-j}) = \alpha^k \sigma^2 \sum_{i=0}^\infty \alpha^{2i} = \frac{\alpha^k \sigma^2}{(1-\alpha^2)}\]
From this equation, we can get the autocorrelation function:
\[ \rho_k(t) =\frac{\gamma_k}{\gamma_0} = \bigg[\frac{\alpha^k \sigma^2}{(1-\alpha^2)}\bigg]\bigg[\frac{\alpha^0 \sigma^2}{(1-\alpha^2)}\bigg]^{-1} = \alpha^k\ (for\ k \ge 0, | \alpha| <1)\]
A theoretical \(AR(1)\) model with a positive correlation decays exponentially. An \(AR(1)\) model with a negative correlation will oscillate between positive and negative correlation due to the \(\alpha^k\) term (taking the \(k-th\) power of a negative number). The smaller the alpha, the more quickly it decays.
# simulate shape of the ACF curve for +/- alphas
rho <- function(k, alpha) {
alpha^k
}
plot(0:10, rho(0:10, alpha=0.7), type='b', main='alpha=0.7')
plot(0:10, rho(0:10, alpha=-0.7), type='b', main='alpha=-0.7')
abline(h=0, lty=2)
Now run a simulation:
# Create a function to simulate AR(1) models
sim_ar1 <- function(alpha, sigma_w=1) {
x <- w_t <- rnorm(500, 0, sigma_w)
for (t in 2:500) {
x[t] <- alpha * x[t-1] + w_t[t]
}
ts(x)
}
# run a simulation and plot
x4_ts <- sim_ar1(alpha=0.9, sigma_w=1)
plot(x4_ts, main="Simulation: AR(1) Model with alpha=0.9")
lag.plot(x4_ts, lags=9, main="Autocorrelation of Lags")
acf(x4_ts, main="Autocorrelation with Lags")
pacf(x4_ts, main="Partial Autocorrelation")
High values of \(\alpha\) result in a persistent model. Note how the time series appears to have windows where it “tends” to go up or down. Note the exponential decay on the ACF plot behaves similarly to the simulated plots above. The fact that only the 1st lag is significant on the partial autocorrelation plot is an indicator of the order of the AR model (ie. \(AR(1)\)).
The dotted lines represent the 95% CI for the ACF. Note that the CI of each of the autocorrelations is the same, independent of lag. This is because of the property of the AR model that both conditional and unconditional variances are constant. The CI (which tightens with sample size \(n\)) is calculated as:
\[ CI: -\frac{1}{n} \pm \frac{2}{\sqrt{n}}\]
While, in theory, points outside the 95% CI would be evidence against the null hypothesis that the correlation at lag \(k\) is zero, we need to be careful about interpretations of multiple hypothesis tests. Even if all autocorrelations are 0, then by chance 5% of the estimates could still fall outside the CI.
Now let’s look at a smaller value of \(\alpha=0.4\):
# run a simulation and plot
x4_ts <- sim_ar1(alpha=0.4, sigma_w=1)
plot(x4_ts, main="Simulation: AR(1) Model with alpha=0.4")
lag.plot(x4_ts, lags=9, main="Autocorrelation of Lags")
acf(x4_ts, main="Autocorrelation with Lags")
pacf(x4_ts, main="Partial Autocorrelation")
We observe that the persistence is much less apparent; the time series looks more “random.” The ACF also declines much more rapidly.
One last simulation (including positive and negative values for \(\alpha\)):
par(mfrow=c(2,2))
a9_ts <- sim_ar1(alpha=0.9, sigma_w=1)
aneg9_ts <- sim_ar1(alpha=-0.9, sigma_w=1)
a4_ts <- sim_ar1(alpha=0.4, sigma_w=1)
aneg4_ts <- sim_ar1(alpha=-0.4, sigma_w=1)
plot(a9_ts, main="Simulation: AR(1) Model with alpha= 0.9")
plot(a4_ts, main="Simulation: AR(1) Model with alpha= 0.4")
plot(aneg9_ts, main="Simulation: AR(1) Model with alpha= -0.9")
plot(aneg4_ts, main="Simulation: AR(1) Model with alpha =-0.4")
Observe that for models with positive autocorrelation factor, the higher value (\(0.9\)) shows much more persistence than the lower value (\(0.4\)). However, for negative autocorrelation factors, the two models are both quite volatile, but note that the scale of values for \(\alpha=-0.9\) is much larger than for \(-0.4\).
Now let’s look at the ACF plots for these 4 models:
acf(a9_ts, main="Simulation: AR(1) Model with alpha= 0.9")
acf(a4_ts, main="Simulation: AR(1) Model with alpha= 0.4")
acf(aneg9_ts, main="Simulation: AR(1) Model with alpha= -0.9")
acf(aneg4_ts, main="Simulation: AR(1) Model with alpha= -0.4")
As expected, we see the ACF for models with \(\alpha < 0\) alternate positive and negative. Also, the model with \(\alpha=0.9\) shows much more persistence than the model with \(0.4\).
If we were to plot the PACF, all 4 plots would drop to zero after the first lag.
The backward shift operator \(B\) shifts from \(x_t\) to \(x_{t-1}\). Using this, we can rewrite the \(AR(1)\) equation:
\[x_t = \alpha x_{t-1} + w_t = \alpha B(x_t) + w_t\]
\[ (1-\alpha B)x_t = w_t \implies x_t= \frac{w_t}{1-\alpha B} = \sum_{i=0}^\infty \alpha^i w_{t-i}\] giving the same results as above.
The Random Walk model is a special case of \(AR(1)\), with \(\alpha=1\).
As with the general \(AR(1)\) case, \(E(x_t)=0\). The variance is:
\[ Var(x_t) = t \sigma^2\]
The autocovariance simplifies to:
\[\gamma_k = t \sigma^2\]
Because autocovariance is a function of time, this model is nonstationary. Because of this, a random walk is suitable only for short-term predictions.
The autocorrelation is:
\[\rho_k(t) = \frac{Cov(x_t, x_{t+k})}{\sqrt{Var(x_t)Var(x_{t+k})}} = \frac{t \sigma^2}{\sqrt{t\sigma^2(t+k)\sigma^2}} = \frac{1}{\sqrt{1+k/t}}\]
Example comes from Cowpertwait.
url <- 'https://raw.githubusercontent.com/mwinton/Introductory_Time_Series_with_R_datasets/master/wave.dat'
wave_df <- read.table(url, header=TRUE)
wave_ts <- ts(wave_df)
# data is collected at 0.1 sec intervals (396 obs = 39.6 sec)
#layout(1:2)
plot(wave_ts)
abline(h=0, lty=2)
# plot first 60
plot(ts(wave_df[1:60,]))
abline(h=0, lty=2)
Observations: Data doesn’t seem to show any trend or seasonal component, so it should be appropriate to assume it’s a realization of a stationary TS process. We also don’t see any outliers. It appears to fluctuate around a constant mean with approximately constant variance. We see quasi-periodicity, but not a fixed frequency.
It’s important to plot the “correlogram” of \(acf\) and \(avcf\) vs. lag using the R function acf.
# look at ACF
acf(wave_ts)
pacf(wave_ts)
# values are stored in acf(...)$acf
head(acf(wave_ts)$acf, 10)
## [1] 1.00000 0.47026 -0.26291 -0.49892 -0.37871 -0.21499 -0.03792
## [8] 0.17764 0.26932 0.13039
Observe the wavelike shape on the ACF that resembles a shrinking \(cos\) function. This is typical of TS generated by an \(AR(2)\) process.
Another visual way to examine dependency structure of a series is a plot of scatterplot matrix looking at autocorrelation with different lag values \(k=1, k=2, ...\).
lag.plot(wave_ts, lag=9)
Note the alternating direction of correlation in these plots agrees with acf.
Downloaded data from MacroTrends.net. Note that the async lectures plotted weekly data; this data is monthly.
#plot ts
unemployment_df <- read.csv('jobless-claims-historical-chart.csv', skip=13, header=TRUE)
unemployment_df <- unemployment_df %>% filter(value>0) %>% mutate(thousands=value/1000)
unemployment_ts <- ts(unemployment_df$thousands, start=c(1967,1), freq=12)
plot(unemployment_ts, xlab="Monthly Series", ylab="First Time Unemployment Claims (in Thousands)")
Observe that series appears very persistent, showing fairly long term upwards and/or downwards trends. This suggests correlation with its own lags is probably high. It also suggests that stationarity of mean, variance, and autocorrelation may not be satisified.
# plot scm against its own lags
lag.plot(unemployment_ts, lags=9, main="Autocorrelation of First Time Jobless Claims (Monthly)")
# plot acf; manually change axis labels if we don't want it in years
acf(unemployment_ts, type='correlation', main="Autocorrelation - First Time Jobless Claims\n(Each bar represents one month)", xlab="Lag", xaxt='n', lag.max=24)
axis_max <- length(unemployment_ts)
axis(1, at=0:axis_max/12, labels=0:axis_max)
abline(h=0.8, lty=3, col="red")
pacf(unemployment_ts)
The scatterplot matrix shows strong correlation for all 9 months displayed. The ACF plot shows a correlation of over 0.80 for 5 months (the dotted red line shows this arbitrary threshhold).